Skip to content

Add Regression Tests#85

Merged
WilliamYue37 merged 11 commits intomainfrom
feat/reg_tests
Jan 26, 2026
Merged

Add Regression Tests#85
WilliamYue37 merged 11 commits intomainfrom
feat/reg_tests

Conversation

@WilliamYue37
Copy link
Member

@WilliamYue37 WilliamYue37 commented Jan 23, 2026

What this does

Runs GPU Regression Tests. The tests consist of training, resuming, and running inference on the model.

How it was tested

I ran it on Github Actions
see (https://github.com/TensorAuto/OpenTau/actions/runs/21304126877/job/61328345455?pr=85)

How to checkout & try? (for the reviewer)

Dispatch on github actions

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

@WilliamYue37 WilliamYue37 self-assigned this Jan 23, 2026
shuheng-liu
shuheng-liu previously approved these changes Jan 25, 2026
Copy link
Collaborator

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a nightly (and manually dispatchable) GPU regression workflow that trains a model, runs a set of log-based sanity checks, converts a checkpoint, and runs inference—plus a few supporting config/doc tweaks.

Changes:

  • Add a new GitHub Actions workflow to run GPU regression training + inference with log validators.
  • Add helper scripts to validate training signals (loss drop, grad norm, grad sync, state dict keys).
  • Reduce distributed log spam by gating from_pretrained prints to the main process; update docs/configs for CI usage.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/opentau/policies/pi05/modeling_pi05.py Gate verbose loading/remapping prints to main process in distributed runs.
docs/source/tutorials/inference.rst Update inference command to point at OpenTau’s inference script.
configs/examples/accelerate_deepspeed_config.yaml Adjust example accelerate config process count (used by regression workflow).
configs/dev/ci_config.json Update CI training config to use pi05 + smaller action chunking and CI-specific settings.
.github/workflows/regression_test.yml Add nightly GPU regression workflow (start runner, train, validate logs, convert, infer, stop runner).
.github/workflows/gpu_test.yml Update GPU runner ASG name and reduce timeout.
.github/scripts/utils.py Add shared grep_file helper for log parsing.
.github/scripts/check_state_keys.py Add validator for missing/unexpected state dict keys in logs.
.github/scripts/check_nonzero_grad_norm.py Add validator ensuring grad norm is present and non-zero.
.github/scripts/check_loss_drop.py Add validator ensuring (smoothed) loss decreases and resume behavior is sane.
.github/scripts/check_accumulate_grad_sync.py Add validator ensuring accelerator.sync_gradients matches grad accumulation cadence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +32 to +36
sync_grads = grep_file(arg.log_path, arg.re_pattern, processor=bool)
assert len(sync_grads) == arg.expected_length, (
f"Expected {arg.expected_length} sync_gradients, found {len(sync_grads)} in {arg.log_path}."
)
assert all(sg == ((i + 1) % arg.gradient_accumulation_steps == 0) for i, sg in enumerate(sync_grads)), (
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processor=bool will interpret both "True" and "False" strings as True (because any non-empty string is truthy), so sync_grads will be incorrect and the assertion will fail/behave incorrectly. Convert explicitly from the captured string (e.g., map "True"->True and "False"->False) before running the pattern check.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +105
- name: Set up Libero Configs
shell: bash
run: |
source .venv/bin/activate
mkdir -p /tmp/libero-assets/libero/libero
export LIBERO_CONFIG_PATH="$(pwd)/.github/assets/libero"

Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GitHub Actions, export LIBERO_CONFIG_PATH=... inside a run step only affects that step's shell; it won’t persist to later steps like "Run Training". If the training/inference needs this env var, write it to $GITHUB_ENV (or set it under the job/step env:) so it’s available in subsequent steps.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@WilliamYue37 WilliamYue37 merged commit 4fa25eb into main Jan 26, 2026
5 checks passed
@WilliamYue37 WilliamYue37 deleted the feat/reg_tests branch January 26, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants